attention stage
Pop-out vs. Glue: A Study on the pre-attentive and focused attention stages in Visual Search tasks
Beukelman, Hendrik, Rodrigues, Wilder C.
Success in these tasks depends on factors like awareness, cognitive abilities, and the nature of the search itself. Some studies have explored the complexities of visual search, focusing on asymmetry, where locating target A among distractors B is easier than finding B among A. Our research specifically examines the asymmetry between finding an oblique line among straight lines versus a straight line among oblique lines. Anne Treisman's study (Treisman & Gelade, 1980) [3] found that certain features, like colour, are more easily detected than others, such as orientation. Further, Treisman & Gormican (1988) [4] showed that identifying a vertical target among oblique distractors took longer than identifying an oblique target among vertical distractors, this supports the idea that a basic feature enhances detection. We aim to replicate these findings with the following research question: Does searching for an oblique target among vertical distractors result in search asymmetry, and vice versa? We anticipate a'pop-out' effect when participants search for an oblique target among vertical distractors, suggesting a parallel search. As opposed to a serial search pattern in the reverse condition. Consistent with Treisman & Gormican's findings [4], we predict faster identification of oblique targets, aligning with the'pop-out' effect, while vertical targets will require focused attention ('glue' phase), particularly as distractor numbers increase.
AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild
Park, Junho, Kong, Kyeongbo, Kang, Suk-Ju
Recently, there has been a significant amount of research conducted on 3D hand reconstruction to use various forms of human-computer interaction. However, 3D hand reconstruction in the wild is challenging due to extreme lack of in-the-wild 3D hand datasets. Especially, when hands are in complex pose such as interacting hands, the problems like appearance similarity, self-handed occclusion and depth ambiguity make it more difficult. To overcome these issues, we propose AttentionHand, a novel method for text-driven controllable hand image generation. Since AttentionHand can generate various and numerous in-the-wild hand images well-aligned with 3D hand label, we can acquire a new 3D hand dataset, and can relieve the domain gap between indoor and outdoor scenes. Our method needs easy-to-use four modalities (i.e, an RGB image, a hand mesh image from 3D label, a bounding box, and a text prompt). These modalities are embedded into the latent space by the encoding phase. Then, through the text attention stage, hand-related tokens from the given text prompt are attended to highlight hand-related regions of the latent embedding. After the highlighted embedding is fed to the visual attention stage, hand-related regions in the embedding are attended by conditioning global and local hand mesh images with the diffusion-based pipeline. In the decoding phase, the final feature is decoded to new hand images, which are well-aligned with the given hand mesh image and text prompt. As a result, AttentionHand achieved state-of-the-art among text-to-hand image generation models, and the performance of 3D hand mesh reconstruction was improved by additionally training with hand images generated by AttentionHand.
Power Law Graph Transformer for Machine Translation and Representation Learning
We present the Power Law Graph Transformer, a transformer model with well defined deductive and inductive tasks for prediction and representation learning. The deductive task learns the dataset level (global) and instance level (local) graph structures in terms of learnable power law distribution parameters. The inductive task outputs the prediction probabilities using the deductive task output, similar to a transductive model. We trained our model with Turkish-English and Portuguese-English datasets from TED talk transcripts for machine translation and compared the model performance and characteristics to a transformer model with scaled dot product attention trained on the same experimental setup. We report BLEU scores of $17.79$ and $28.33$ on the Turkish-English and Portuguese-English translation tasks with our model, respectively. We also show how a duality between a quantization set and N-dimensional manifold representation can be leveraged to transform between local and global deductive-inductive outputs using successive application of linear and non-linear transformations end-to-end.